GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications
نویسندگان
چکیده
Crawling web applications is important for indexing, accessibility and security assessment. Crawling traditional web applications is an old problem, for which good and efficient solution are known. Crawling Rich Internet Applications (RIA) quickly and efficiently, however, is an open problem. Technologies such as AJAX and partial Document Object Model (DOM) updates only make the problem of crawling RIA more time consuming to the web crawler. One way to reduce the time to crawl a RIA is to crawl a RIA in parallel with multiple computers. Previously published Dist-RIA Crawler presents a distributed breath-first search algorithm to crawl RIAs. This paper expands Dist-RIA Crawler in two ways. First, it introduces an adaptive load-balancing algorithm that enables the crawler to learn about the speed of the nodes and adapt to changes, thus better utilize the resources. Second, it present a distributed greedy algorithm to crawl a RIA in parallel, called GDist-RIA Crawler. The GDist-RIA Crawler uses a server-client architecture where the server dispatched crawling jobs to the crawling clients. This paper illustrates a prototype implementation of the GDist-RIA Crawler, explains some of the techniques used to implement the prototype and inspects empirical performance measurements.
منابع مشابه
Recombination Operators in Genetic Algorithm - Based Crawler: Study and Experimental Appraisal
A focused crawler traverses the web selecting out relevant pages according to a predefined topic. While browsing the internet it is difficult to identify relevant pages and predict which links lead to high quality pages. This paper proposes a topical crawler for Vietnamese web pages using greedy heuristic and genetic algorithms. Our crawler based on genetic algorithms uses different recombinati...
متن کاملBuilding Data-Intensive Grid Applications with Globus Toolkit - An Evaluation Based on Web Crawling
Nowadays, there is a trend to create resource-consuming applications without building heavy computer centers, but to use resources on computer systems distributed over the internet. Grid middleware is a framework to access these resources. The concern of this paper is the evaluation of a specific grid middleware, namely Globus Toolkit, for data-intensive applications. As a test case, we have de...
متن کاملEnabling automatic testing of Modern Web Applications using Testing Plug-ins
Modern web applications are very dynamic in nature with rich user experience. Such applications typically use Web 2.0 and Asynchronous JavaScript and XML (AJAX) technologies. These applications are very different from conventional web applications as they use stateful C/S communication in an asynchronous fashion. The use agent is able to communicate with web server without explicit form submiss...
متن کاملWeb Crawler: A Review
Information Retrieval deals with searching and retrieving information within the documents and it also searches the online databases and internet. Web crawler is defined as a program or software which traverses the Web and downloads web documents in a methodical, automated manner. Based on the type of knowledge, web crawler is usually divided in three types of crawling techniques: General Purpo...
متن کاملSearch optimization technique for Domain Specific Parallel Crawler
Architectural framework of World Wide Web is used for accessing linked documents spread out over millions of machines all over the Internet. Web is a system that makes exchange of data on the internet easy and efficient. Due to the exponential growth of web, it has become a challenge to traverse all URLs in the web documents and handle these documents, so it is necessary to optimize the paralle...
متن کامل